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Abstract 

Concave regularization methods provide natural procedures for sparse recovery. However, 
I 1 " they are difficult to analyze in the high dimensional setting. Only recently a few sparse recovery 

results have been established for some specific local solutions obtained via specialized numerical 
procedures. Still, the fundamental relationship between these solutions such as whether they are 
identical or their relationship to the global minimizcr of the underlying nonconvcx formulation is 
unknown. The current paper fills this conceptual gap by presenting a general theoretical frame- 
work showing that under appropriate conditions, the global solution of nonconvex regularization 
leads to desirable recovery performance; moreover, under suitable conditions, the global solution 
corresponds to the unique sparse local solution, which can be obtained via different numerical 
qq ■ procedures. Under this unified framework, we present an overview of existing results and dis- 

00 , cuss their connections. The unified view of this work leads to a more satisfactory treatment of 

&\ ■ concave high dimensional sparse estimation procedures, and serves as guideline for developing 

further numerical procedures for concave regularization. 

od 
O. 

1 Introduction 

Let X be an n x p design matrix and y G M n a response vector satisfying 

y = X(3 + s, (l) 

where j3 G W is a target vector of regression coefficients and e G M. n is a noise vector. This 
paper concerns the estimation of the value of X/3, that of (3, or its support set supp(/3), where 
supp(6) := {j : bj / 0} for any vector b = (pi,..., b p ) T G W. 

We are interested in the high-dimensional case where n and p are both allowed to diverge, 
including the case of p 3> n. We assume that the target vector /3 is sparse in some sense; such as the 
Iq sparsity |supp(/3)| < CQn/lnp, or the capped-^i sparsity ^^ =1 min(f, \ j3j / a\^/ n/ hip) < con/lnp, 
where a is a certain noise level and cq is a fixed small constant. While we are mainly interested in 
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the Gaussian noise e ~ iV(0, a 2 I nxn ) or zero-mean sub-Gaussian noise, the specific noise properties 
required in our analysis will be provided later. 

We consider the following class of penalized least squares estimators 



where b = (pi, . . . , b p ) T and p(t\ A) is a scalar regularization function with a certain regularization 
parameter A > 0. As an example, we may let p(t;X) = \ 2 I[t ^ 0)/2, which corresponds to the 
£q regularization problem. Here /(•) denotes {0, 1} valued indicator function. Since I(t ^ 0) is 
a discontinuous function at t = 0, the corresponding £q optimization problem may be difficult to 
solve. In practice, one also looks at continuous regularizers that approximate £q regularization, 
such as p(t;X) = min(A 2 /2, A|t|). As we will show in the paper, sparse local solutions of such 
regularizers can be obtained using standard numerical procedures (such as gradient descent), and 
they are closely related to the global solution of ([2]). 

2 Survey of Existing Concave Regularization Results 

While this survey is not intended to be comprehensive, it presents a high-level view of some impor- 
tant contributions to the area of concave regularization. We will discuss both methodological and 
analytical contributions. 

2.1 Terminologies 

The following notation is used throughout the paper. For any dimension d, bold face letters denote 
vectors and normal face their elements, e.g. v = (v±, . . . , vj) , with supp(t>) being its support 
{j : Vj / 0} n {0, . . . ,d}. Capital bold face letters denote matrices, e.g. X and S. The £ q "norm" 
of v is ||^|| g := {Y^ d j=\\ v j\ q ) l ^ q for < g < oo, with the usual extension ||i>||o := |supp(i>)| and 
Halloo := maxj<d \vj\. Design vectors, or columns of X, are denoted by Xj. For simplicity, we 
assume throughout the paper that the columns X are normalized to 



This condition is not essential but it simplifies some notations. For variable sets A C {1, . . . 
Xa = G A) denotes the restriction of columns of X to A, and &a — (bj,j £ A) T the 

restriction of vector b G M p to A. The maximum and minimum eigenvalues of matrix S are 
denoted by A max (S) and A min (£). 

Definition 1. The following terminologies will be used to simplify discussion. 

(a) The £q sparsity of (3 means \\(3\\q < s* . To allow (3 with many more components near zero, a 
weaker notion of capped-£\ sparsity is ^ ■ min(l, \(3j\/\ U niv) < s* , where \ un iv = o~\J (2/n) lnp 
is the universal threshold level for a certain noise level a. 

(b) A regularity condition on X is a class X of (column-normalized) matrices that match a 
sparsity condition on (3 to guarantee a desired result. Such a regularity condition can be 
stated as X £ J^,* xp , with matrix classes =^" xp C R nxp indexed by (n,p,s*), where s* is the 



(3 := argminLA(b) 

6eRp 





j'=i 




y/n. 
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sparsity level of the matching regularity condition on (3. Such a condition on X is called an 
£2 regularity condition (or simply £2 regular) if the matrix classes ^~^l xp are sufficiently large 
to satisfy the following condition: 

— Given any uq > 1, there exists a constant c$ > such that for all < S < 1/e 

inf U(Q-\3%**)) : n € J£g* t (s*MMp/$) < min(n,p, a*) > l) > 1 - <5, 

)j,,n,p,s* I J 

where ^#™ xp is i/ie se£ of probability measures in W ixp under which the rows o/ffi nxp are 
ud N(0, £) /or some 5] with A max (£)/A m i n (S) < uo otic? identical diagonal elements, 
and Q is the column normalization mapping given by Q(X) = (xjn 1 / 2 /\\xj\\2, j < p). 

(c) An estimator (3 is selection consistent if supp((3) = supp((3), and sign- consistent i/sgn(/3) = 
sgn(/3), with the convention sgn(O) = for the sign function. 

(d) An estimator has the oracle property if 

3 = 3°, P° s = (X^XsyxJy, su PP 0°) C S, (3) 
where S = supp({3). The estimator (3 is called the oracle LSE. 

Remark 1. The standard regularity condition for the classical low- dimensional statistical scenario 
of p < n is that the rank of X is p. Definition^ (b) generalizes this classical regularity condition 
to allow p 3> n. We may explicitly include the classical situation into the definition of £2 regularity 
(that is, require %~Jl xp to contain all column-normalized nxp matrices of rank p) if we confine our 
discussion to fixed sample conditions. See the last paragraph of this subsection for more discussion. 

Remark 2. If we consider a sequence of models in (Op with n — > 00, then asymptotically an 
estimator has the oracle property (allowing statistical inference for all linear functionals of (3) if 

su Pa P{|a T (3 - 3°)| 2 > eVar(a T 3°)} = o(l) Ve > 0, 

and this is a weaker requirement than because it allows (3 to converge only asymptotically to 
(3 . While this work focuses on the stronger requirement (0) that is easier to interpret in the finite 
sample situation, the weaker definition has been used in some previous asymptotic analysis. 

For < r < 1, the capped-^ sparsity condition holds for all vectors with \\j3\\ r < R as long as 

(R/ XunivY < s *- 

In the classical statistical scenario of p < n, a standard regularity condition on the design matrix 
X is that the rank of X is p. Definition [2(b) generalizes this classical regularity condition to p S> n. 
For example, infM| <3s *{rank(X J 4)/|^4|} = 1 is £2 regular. The £2 notion allows an assessment of the 
strength of assumptions on X by random matrix theory without repeating technical statements of 
more specialized conditions. Moreover, since the £2 criterion is required to hold for (s*/n) ]n(p/6) < 
Co, results based on £2 regularity condition on X and matching sparsity condition of (3 must apply 
to the case of large p, including p 3> n. Since regularity conditions on (3 and X must work 
together to guarantee their consequences, for simplicity the sparsity level s* for £2 regularity is 
always understood in the sequel as the £q or capped-^i sparsity level of (3 given in Definition Q] (a). 
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Throughout the paper, X and (3 in (pQ) are treated as deterministic. Since the £2 criterion is 
about the size of JT s * xp , it does not imply randomness of X. In fact, since the £2 criterion is required 
to hold simultaneously for all \x € ^#" xp with the same ^,™ xp in M nxp , an £2 regularity condition 
is weaker than the condition of a random X with distribution for a fixed \i G ^Mu^ v and 

typically requires a more explicit specification of the matrix class JT s * xp . We call the criterion £2, 
since it depends only on the range of the spectrum (the smallest and largest eigenvalues) of S. 

The rest of the subsection discusses different forms of £2 conditions. Since the meaning of 
sparsity level is always clear in its proper context, for simplicity we will discuss design matrix 
conditions without explicitly referring to their sparsity levels. 

In what follows, we will briefly explain some ^-regularity conditions appeared in the literature. 
Related conditions have been introduced first in the compressive sensing literature to analyze £\- 
regularized recovery of a sparse (3 from its random projection Xf5 with iid iV(0, 1) entries in X. 
The most well-known of such conditions is the restricted isometry condition (RIP) introduced in 
[10j . In order to explain RIP, we first define the lower and upper sparse eigenvalues as 

K„(?n) := min HXt^H^/re, K + (m) : = max H-X'ttl^/n. (4) 

|| 1* ||o <m; 1 1 2 = 1 ||**||o<"i>||'"||2 = l 

RIP requires 5k + 62k + <^3fc < 1 with k = \\P\\o and 5 rn = max{«; + (m) — 1,1 — K_(m)}. A 
related condition is the uniform uncertainty principle (UUP) 82k + @2k,k < 1 m [H]; where O^i = 
ma~x(X ava) T (X bub)/™ with AnB = 0, \A\ = k, \B\ = £, and ||w||2 = IMk = 1- F° r t\ regularized 
estimators, bounds of the optimal order for the £ 2 -norm estimation error ||/3 — /3 1 1 2 can be obtained 
under RIP, UUP, as well as their improvement 61,25k + &i.25k,k < 1 in [7]. While the conditions 
for RIP and UUP are specialized to hold for random designs with covariance matrix S = I pX p, 
related conditions using sparse eigenvalues can be defined to fulfill the £2 criterion in Definition 
[TJ^b); for example the sparse Riesz condition (SRC) ||/3||o < max m 2m/{l + K + (m)/ K^(m)} in 
[431 S2]) and some other extensions in |44| 1-4=1 j . These more general conditions are £2 regularity 
conditions by our definition, and they lead to ^-norm estimation error bounds of the optimal order 
for i\ regularized estimators. Additional refinements were introduced in the literature, such as the 
restricted eigenvalue of j3j [22] , 

RE 2 = RE 2 (£,S) :=\^^\Xu\\2l{\u\\2n Xl2 ) : [|«^||i < 

where S = supp(/3), and the compatibility factor of [37[ [55]. 

REi = REi(£, S) :=inf{|5| 1 / 2 ||Xu|| 2 /(||u s ||iTi 1 /2) . < £\\u s \\i} . 

It can be shown that REi > RE 2 and appropriate sparse eigenvalues imply RE 2 > 0. Therefore both 
RE 2 and REi are £2 regularity conditions. Moreover, for £\ regularized estimators, REi provides l\- 
norm estimation and ^ 2 -norm prediction error bounds of optimal order, and RE 2 provides £ 2 -norm 
estimation bounds of optimal order. 

This paper employs an even weaker condition involving a restricted invertibility factor RIF g in 
(j!4h which is related to the cone invertibility factor CIF g (q > 1) defined below: 

CIF, = CIF,(f,S) := inf j l g | 1/g ^ XM IU . || ^ < £\\ Ush \ (5) 

L n \\u L i 
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The quantity CIF,j and its sign-restricted version have appeared in |41j . where invertibility factor- 
based £ q error bounds of the form (I19p below have been proven to sharpen earlier results for the 
Lasso and Dantzig selector [9j 03j [3j 0U [39] when g e [1,2]. Such error bounds are of optimal order 
[lU [30]. Of special interests are q E [1, 2] for which the condition ClF q > on X is £2 regular and 

CIF&S) > (1 + 02 , CIFaCe.S) > ^ > + . (6) 

Thus, CIFq > is an £2 regularity condition for q G [1, 2]. 

A main advantage of using invertibility factor is that for q > 2, invertibility factors still yield 
£ q error bounds of optimal order which match results in [44 [ I41j . However, the sparse and re- 
stricted eigenvalues do not yield error bounds of optimal order due to the unboundedness of 
max|| u || 2=1 ||'U5|| (? ||'ii,s||i/|S| 1 / 9 in 

We shall point out that different £2 regularity conditions are typically not equivalent since 
different norms are involved in the definitions of different quantities. For instance, in a specific 
example given in [31 [39], REi and CIF2, uniformly bounded from away from zero, yield £\ and £2 
error bounds of optimal order respectively, but RE2 does not. 

In the above discussion, we focus on fixed sample conditions like RE2 > and CIF2 > 0, which 
hold when rank(X) = p. These conditions can be directly seen as £ 2 regular from their existing 
lower bounds for p > n such as those in (3] [41]. The optimality of the order of the error bounds 
based on such quantities can be also stated as £ 2 regularity conditions by comparing them with 
sparse eigenvalues. See Remark [7] for more discussion. 

2.2 Previous Results 

Among concave penalties, the £\ penalty is the only convex one. Thus, the Lasso {£\ regularization) 
is a special case of ([2]) with p(t; A) = A|t| [35] fTT) : 



3<* 



= arcr mm 

In 



arg mm 



"X&-y||l + A||&|| 1 



(7) 



As a function of A, the Lasso path (3 = (3 (A) matches that of £\ constrained quadratic pro- 
gramming. One may use the homotopy/Lars algorithm to compute the complete Lasso path for 
A E [0, 00) |28[ [29l [T3] or simply use a standard convex optimization algorithm to compute the 
Lasso solution for a finite set of A. The Dantzig selector, proposed in [9], is an £i-minimization 
method related to the Lasso, which solves 

(3 = argmin ||6||i subject to ||X T (Xb — y)\\oo < A. 

beRP 

It has analytical properties similar to that of Lasso, but can be computed by linear programming 
rather than quadratic programming as in Lasso. Analytic properties of the Lasso or Dantzig selector 
have been studied in |3H HZl 12^1 E51 SQl El El IBB1 1^1 IS 1221 IS1 ISSl 171 SI] - A basic story is 
described in the following two paragraphs. 

Under various £ 2 regularity conditions on X and the £q sparsity condition on (3, the Lasso and 
Dantzig selector control the estimation errors and the dimension of the selected model in the sense 

11*0 - *«! + „ I? + |,3|| = OAs'), 1 < , < 2, (8) 



a 2 hip {{a 2 /n)\ap} q / 2 
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[91 [38l [43l El [221 [441 SH [5] . Compared with the oracle 3 in O, the estimation loss of 3 is inflated 
by a factor of no greater order than \/lnp, and the size of the selected model is of the same order 
as the true one. When ln(p/n) x lnp, it has been proved in [HI [30] that ([8|) matches the order of 
the risk of a Bayes estimator for a class of (weak) signals close to zero, so that the order of this 
loss inflation factor y/lnp is the- smallest possible without further assumption on the strength of 
the signal (3. This inflation factor can be viewed as the cost of not knowing supp(/3). Nevertheless, 
when (3 is strong (in the sense that its minimum nonzero coefficient is not close to zero), then it is 
possible to achieve the oracle property, which removes the inflation factor. However even in such 
cases, the logarithmic inflation is still present for the Lasso solution, and it is generally referred 
to as the Lasso bias; it means that the Lasso does not have the oracle property even when the 
signal is strong \14\ 115] , Nonconvex penalty can be used to remedy this issue. For the Lasso and 
Dantzig selector, extensions of ([8]) have been established for capped-^i sparse j3 [43l[44j[4T| and for 
2 < q < oo under certain i q regularity conditions on X |44[l41j. Error bounds of type ([8]) have been 
used in the analysis of the joint estimation of the noise level a* := 1 1 £r 1 1 2/ an d (3 [32 | 12] 133 ] 134]. 
For example, the scaled Lasso 

{/3, a} = argmin {\\y - Xb\\ 2 /{2na) + 2v / (lnp)/n||6||i} 

{b,a} 

provides \a/a* — 1| = Op ( |5| (In p)/n) along with ([8]) under £2 regularity conditions [34] . 
For variable selection, the Lasso is sign consistent in the event 

sgn^Vsgnta), minimi >0;A, A > 'JSE. E W) , (9) 

where 9\ = ||(XjX 5 /n)- 1 sgn(/3 5 )|| 0O , 9* 2 = ||X^X5(XjX 5 )^ 1 sgn(/3 5 )|| , S = supp (/3), and 
3° is the oracle estimator in © (Ml EH EE HO] . Since ||3°-/3||oo = O p (1) y/Qn \\/3\\o)/n = o P (A) 
under mild conditions, 6\ and Q\ are key quantities in ([9]). For fixed kq < 1, 0\ — K o i s ca h e d 
the neighborhood stability/strong irrepresentable condition [26\ 148]. For X with iid iV(0, S) rows 
and given S, 9\ and Q\ are within a small fraction of their population versions with I] in place of 
X 1 X/n [10]. For random (3 with ||/3||o % n /{||^-^ T /p||2 lnp} and uniformly distributed sgn(/3) 
given ||/3||o, 0\ < 2 and Q\ — 1 — l/v2 with large probability under the incoherence condition 
maXj^/fc |a;JiCfc/n| < l/(lnp) [8]. It is worth mentioning that neither the incoherence condition nor 
the strong irrepresentable condition is £2 regular: in fact they may both fail with 0* x IS 1 ] 1 / 2 and 
minj g s I > 6*[\ even in the classical setting of X being rank p. Since Q\ — 1 i s necessary for 
the selection consistency of the Lasso under the first two conditions of ([9|) |36[ |4"0] , this means that 
Lasso is not model selection consistent under £2 regularity conditions. In order to achieve model 
selection consistency under £2 regularity, we have to employ a nonconvex penalty in ([2]). 

For sparse estimation, £q penalized LSE corresponds to the choice of p(t; A) = A 2 /2I(t 7^ 0) in 
([2]), and it was introduced in the literature [H 1241 131] before Lasso. Formally, 



3(4) 

p = arg mm 



1 X 2 

±n X b-y\\ 2 2 + —\\b\ 
2 11 y\\2 2 11 1 



(10) 



This method is important for sparse recovery because with the Gaussian noise model £ ~ N(0,cr 2 I), 
uniform distribution on support set, and flat distribution of f3 within support, it is a Bayesian 
procedure for support set recovery. However, this penalty is not easy to work with numerically 
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because it is discontinuous at zero. The Lasso can be viewed as a convex surrogate of ()10p . but 
it does not achieve model selection consistency under £2 regularity, nor does it have the oracle 
property when the signal is uniformly strong. 

Continuous concave penalties other than Lasso have been introduced to remedy these problems. 
These concave functions approximate £q penalty better than Lasso, and thus can remove the Lasso 
bias problem. Most concave penalties are interpolations between the Lasso and the £q penalty. 
For example the l a (bridge) penalty [16] with < a < 1 is equivalent to the choice of p(t\ A) = 
|t| Q A 2_a {2(l — a)} 1 ~ a /(2 — a) 2 ~ a in ([2]). While the bridge penalty is continuous, its derivative is 00 
at t = 0, which may still cause numerical problems. In fact, the 00 derivative value means that (3 = 
is always a local solution of (|2|) for bridge penalty, which prevents any possibility for the uniqueness 
of a reasonable local solution among sparse local solutions — a topic which we will investigate in this 
paper. In order to address this issue, additional penalty functions p(t; A) with finite derivatives at 
t = have been suggested in the literature, such as the SCAD penalty [H], and the MCP penalty 
[42j . These penalties can be written in a more general form as p(t; A) = X 2 p(t/X) with p(Q) = and 
1 - t < (d/dt)p(t) < 1 for t > 0, including the SCAD with (d/dt)p(t) = 1 A (1 — (i — l)/(7 — 1))+, 
7 > 2, and the MCP with (d/dt)p(t) = 1 A (1 - t/j) + , 7 > 1. It can be verified that the £ a 
penalty for < a < 1, the SCAD and MCP are all concave. Another simple concave penalty is 
p(t;X) = min(A 2 7/2, A|t|), 7 > 1, introduced in [35] as capped-£i penalty. 

The above mentioned nonconvex interpolations of £q and £\ penalties typically gain smoothness 
over the £q penalty and thus allow more computational options. Meanwhile, they may improve 
variable selection accuracy and gain oracle properties by reducing the bias of Lasso. A more direct 
way to reduce the bias of Lasso is via the adaptive Lasso procedure [39] , which solves the following 
weighted £\ regularization problem for some a E (0, 1): 

1 P 
min — \\y — Xb\\% + A >^ \wA~ a , 

i=i 

where w is an estimator of (3 (for example, the solution of the standard unweighted Lasso with 
regularization parameter A). A low-dimensional analysis in [39] showed that the Adaptive Lasso 
solution can achieve the oracle property asymptotically. A high dimensional analysis of this proce- 
dure was given in [18]. For variable selection consistency and oracle properties to hold, the adap- 
tive Lasso requires stronger conditions in terms of the minimum signal strength min^ggupp^) \/3j\ 
than what is optimal. Specifically, the optimal requirement is ™in-j£supp{f3) — l^univ with 
Xuniv = cry (2/ n ) m P f° r some constant 7 that may depend on an £2 regularity condition (also 
see Eq (jlip below), which can be achieved by other procedures [42\ I47j: however, adaptive Lasso 
requires rnirijesupp(/3) \/3j\ to be significantly larger than the optimal order of \ un iv This means 
adaptive Lasso is sub-optimal for sparse estimation problems. We also observe that adaptive Lasso 
does not directly minimize a concave loss function, and hence it is not an instance of @- It 
was later noted that this procedure is only one iteration of using the so-called MM (majorization- 
minimization) principle to solve ([2]) with bridge penalty (for example, see [50]). The corresponding 
MM procedure is referred to as multi-stage convex relaxation in \45\ I47| . For sparse estimation 
problem ([2]) with a penalty p(t; A) that is concave in \t\, this method iteratively invokes the solution 
of the following reweighted £\ regularization problem for stage £ = 1, 2, . . ., starting with the initial 
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p = arg mm 



value of (3^ = 0: 

3=1 

where A,- = (d/dt)p(t; A)| |3<<-i)i (j = 1, - - - This procedure may be regarded as a multi-stage 

-(2) 

extension of adaptive Lasso, which corresponds to the stage-2 solution j3 with bridge penalty. 
Unlike results for adaptive Lasso, the results in [45 \ 137] for the multistage relaxation method allow 
mm jesupp(/3) to achieve the optimal order of \ un iv, which match those of [32] and improve 
upon [15] . Moreover, only £ = O(ln(||/3||o)) stages is necessary in order to achieve model selection 
consistency and oracle properties. It is worth pointing out that the multi-stage procedure can also 
be adapted to work with the Dantzig selector formulation [23] . 

For large p, the global solution of a nonconvex regularization method is hard to compute, so that 
local solutions are often used instead. Therefore theoretical analysis of nonconvex regularization 
has so far focused on specific numerical procedures that can find local solutions. For the £q penalty, 
the penalized loss in ([2]) is typically evaluated for a subset of the 2 P possible models supp(b) such as 
those generated in stepwise regression. For smooth concave penalties, iterative algorithms can be 
used to find local minima of the penalized loss in ([2]) for a set of penalty levels [191 EH H3 HI [25] [37J . 
For the MCP and other quadratic spline concave penalties, a path following algorithm can be used 
to find local minima for an interval of penalty levels [32] . 

Advances have been carried out in the analysis of nonconvex regularization methods in multiple 
fronts [HI [151 SSI [HI [36l [32l [35l [5] . For concave penalized loss in ([2]) , local minimizers exist with 
the oracle property (|3|) under mild conditions [141 115]. However, it remains unclear whether there 
exist computationally efficient procedures that can find local minimizers investigated in [14^ [T5] . 
For the MCP, the local minima generated by the path following algorithm controls the estimation 
error and model size in the sense of ([8]) under an £2 regularity condition on X [42]. Under the 
additional condition 

min m > j\ univ > sup {t : (d/M)p(t\ A) ^ 0} (11) 



with \ UTl i v = GsJ (2/n) lnp and a certain constant 7 > 1, the same path following solution has 
the oracle property ([3]) and thus the sign-consistency property [32]. Similar results hold for the 
SCAD and certain other quadratic spline penalties [32]. Under (llip and £2 regularity conditions 
on X, the oracle property ([3]) and model selection consistency has also been established for a 
specific forward/backward stepwise regression scheme [46] that can be regarded as an approximate 
£q penalty minimization algorithm. As we have mentioned earlier, the multi-stage relaxation scheme 
for minimizing ([2]) also leads to oracle inequality and model selection consistency under (jllh and 
£2 regularity conditions on X [45 \ [37] . 

While a number of specialized results were obtained for specialized numerical procedures under 
appropriate conditions, it is not clear what are the relationship among these solutions. For example, 
it is not clear whether the global solution of ([2]) is unique and whether it corresponds to solutions 
of various numerical procedures studied in the literature. This leads to a conceptual gap in the 
sense that it is not clear whether we should study specific local solutions as in the above mentioned 
previous work or we should try to solve ([2]) as accurately as possible (with the hope of finding the 
global solution). It is worth mentioning that related to this question, oracle inequalities involving 
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global solutions with nonconvex penalties have been studied in the literature (for example, see 
related sections in [5]). However, such oracle inequalities do not lead to results comparable to those 
of [421 [4"T] . Another relevant study is [20J , which showed that in the lower dimensional scenario 
with p < n, the global solution of ([2]) agrees with the oracle estimator (3 for the SCAD penalty 
when ming^o |/3?| is sufficiently large, and some other appropriate assumptions hold. However, 
their analysis does not directly generalize to the more complex high dimensional setting. 

The purpose of the remaining of this paper is to present some general results showing that under 
appropriate ^-regularity conditions, the global solution of an appropriate nonconvex regularization 
method leads to desirable recovery performance; moreover, under suitable conditions, the global 
solution corresponds to the unique sparse local solution, which can be obtained via different numer- 
ical procedures. This leads to a unified view of concave high dimensional sparse estimation methods 
that can serve as a guideline to develop additional numerical algorithms for concave regularization. 

3 High-Level Description of Main Results 

As we have discussed in our brief survey, concave regularized methods have been proven to control 
the estimation error and the dimension of the selected model ([8]) under £2 regularity conditions and 
possess the oracle property ([3]) or the sign-consistency property under the additional assumption 
(fTTI) . However, these results are established for specific local solutions of with specific penalties. 
For p > n it is still unclear if the global minimizer in ([2]) is identical to these local solutions or 
controls estimation and selection errors in a similar way. In this paper, we unify the aforementioned 
results with the global solution of ([2]). Technical results are rigorously described in Section U] below. 
This section explains the main thrust of these results. 

We are mainly interested in two situations: £q regularization where p(t; A) is discontinuous at 
t = 0, and smooth regularization which is continuous for all i > and piece-wise differentiable. 
However, our basic results require only sub-additivity and monotonicity of p(t;X) in t in [0,oo). 

We shall first describe assumptions of our analysis in Subsection 14.11 As we have pointed 
out, the key regularity conditions required in our analysis are expressed in terms of the sparse 
eigenvalues in or invertibility factors RIF and CIF defined in ()14[) and ©. For the sake of 
clarity, we assume that these quantities are all constants, and this requirement is an £2 regularity 
condition. Another condition required by our analysis is called null- consistency, which requires 
that if f3 = 0, then the global minimizer of (|2|) is achievable at (3 = (the actual condition, given 
in Assumption [2] is slightly stronger). Clearly this condition depends both on the matrix X and 
on the noise vector e. It is shown in Subsection 14.11 that under the standard sub-Gaussian noise 
assumption (see Assumption [T]) , the null-consistency condition is £2 regular. 

In summary, all assumptions on X needed in our analysis are £2 regular; with this in mind, we 
may examine the main results, which are divided into four subsections. 

Subsection 14. 21 is concerned with basic properties of global optimal solution of ([2]) for all subad- 
ditive nondecreasing penalties. Theorem [TJ gives £ g -norm error bounds for \\(3 — f3\\ q and a bound 
of the prediction error \\X[3 — X(3\\2 that are comparable with known results for £\ regulariza- 
tion. This means that under appropriate £2 regularity conditions, the global solution of concave 
regularization problems are no worse than the Lasso solution in terms of the order of estimation 
error. Theorem [2] shows that the global optimal solution of ([2]) is sparse, and under appropriate £2 
regularity conditions, the sparsity is of the same order as ||/3||o; that is, ||/3||o = O(||/3||o). Thus, 
([8]) holds for the global solution of (J2J) . Moreover, if the second order derivative of p(t; A) with 
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respect to t is sufficiently small, then the global solution is also the unique sparse local solution of 
([2]). That is if a vector (3 is a local solution of ([2]) which is sparse: ||/3||o = O(||/3||o), then j3 is the 
global solution of ([2]). None of these results require that ming^o \Pj I to be bounded away from 
zero. Furthermore, since these results require only £2 regularity conditions, they apply to the case 
of p S> n as long as s* (In p)/n is small. 

Subsection 14.31 contains results specifically for £q regularization. First, the global solution of 
Iq regularization is sparse. Moreover, with sub-Gaussian noise, the prediction error bound for £q 
penalty in Theorem [3] does not depend on properties of the design matrix X. This significantly 
improves upon the corresponding result for general penalties in Theorem [TJ which requires a non- 
trivial RIFi condition on the design matrix X. If the smallest sparse eigenvalue of X T X/n is 
bounded from below, then we obtain in Theorem 0] the selection consistency for £q regularization 
under (jlip . which implies the oracle property. 

Subsection 14.41 considers penalties p(t; A) which are both left- and right-differentiable, for which 
one can define (approximate) local solutions that are what numerical optimization procedures 
compute. Theorem [5] considers the distance between two approximate local solutions. An immediate 
consequence of the result says that under appropriate assumptions, if (d / dt) p(t; A) = when t is 
sufficiently large, then there is a unique sparse local solution of ([2]) that corresponds to the oracle 
least squares solution (3 under (fTTj) . Therefore the unique local solution has the oracle property. 
Moreover, this unique local solution has to be the global optimal solution according to Theorem [2j 
While Theorem [5] shows that it is possible for a penalty that is not second order differentiable to have 
a unique sparse local solution, it requires the condition for such penalties. In contrast, with a 
second order differentiable concave penalty, condition (jlip is not needed in Theorem [5] for sparse 
local solutions to be unique. This suggests an advantage for using smooth concave penalties which 
may lead to fewer local solutions under certain conditions. Theorem [6] gives sufficient conditions 
under which the global optimal solution of ([2]) achieves model selection consistency. These sufficient 
conditions generalize the irrepresentable condition ([9]) for the model selection consistency of Lasso. 
However, unlike the irrepresentable condition for Lasso, which is not an £2 regularity condition, for 
a concave penalty where (d/dt)p(t; A) is small for sufficiently large t, the generalized irrepresentable 
condition required in Theorem [6] automatically holds when min^.^o \f3°\ is not too small. Moreover, 
for appropriate nonconvex penalties, it is possible to achieve a selection threshold of optimal order 
as in CD]). 

Note that results in Subsection 14.41 show that if one can find a local solution of ([2]) and the 
solution is sparse, then under appropriate conditions, it is the global solution of ([2]) and it is close 
to the oracle least squares solution (3 . It is possible to design numerical procedures that find 
a sparse local solution of ([2]). For such a procedure, results of Subsection 14.41 directly applies. 
Subsection 14.51 further develops along this line of thinking. Theorem [7] shows that if a local solution 
is also an approximate global solution, then it is sparse. This fact can be combined with results 
in Subsection 14.41 to imply that under appropriate conditions, this particular local solution is the 
unique sparse local solution (which is also the global solution). Moreover, such a solution can be 
obtained via Lasso followed by gradient descent, as it can be shown that Lasso is a sufficiently 
accurate approximate global solution of (J2]) for the result to apply. 

Our results essentially imply the following: under appropriate £2 regularity conditions, plus 
appropriate assumptions on the penalty p(t;\), procedures considered earlier such as MCP [42] 
or multi-stage convex relaxation [1 9|, [5U| I45j give the same local solution that is also the global 
minimizer of ([2]). Moreover, other procedures (such as Lasso followed by gradient descent) can be 
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designed to obtain the same solution. Therefore these results present a coherent view of concave 
regularization by unifying a number of earlier approaches and by extending a number of previous 
results. This unified theory presents a more satisfactory treatment of concave high dimensional 
sparse estimation procedures. 

4 Technical Statements of the Main Results 

This section describes in detail our new technical results characterizing the global and local optimal 
solutions of ([2]) under different regularization conditions. Before going into the main results, we 
will specify some assumptions and definitions required in our analysis. 

4.1 General Assumptions and Definitions 

In this subsection, we describe and discuss general conditions imposed in the rest of the paper. 

We first consider conditions on the regularizer p(t;X). We assume throughout the sequel the 
following conditions on the penalty function: 



(iii) p(t;\) is non-decreasing in t in [0,oo); 

(iv) p(t; A) is subadditive with respect to t, p(x + y; A) < p(x; A) + p(y; A) for all x, y > 0. 

This family of penalties is closed under the summation and maximization operations and includes 
all functions increasing and concave in \t\. Although we are mainly interested in the case where 
p(t; A) is concave in \t\, all of our results hold under the above specified weaker conditions, sometimes 
with side conditions such as the monotonicity of p(t; X)/t for t > and the continuity of p(t; A) at 
t = 0. Therefore we will mention explicitly when such side conditions are needed. 

We are particularly interested in the l§ regularization p(t; A) = (A 2 /2)I(t ^ 0) which is discon- 
tinuous at t = 0. In addition, we are interested in regularizer p(t; A) that is continuous in t > and 
piece-wise differentiable. With such regularizers, local solutions of ([2]) can be defined as solutions 
with gradient zero. A local solution can be obtained using standard numerical procedures such as 
gradient descent. 

Given a regularizer p(t; A) and any fixed A > 0, we define the threshold level of the penalty as 



The quantity A* is a function of A that provides a natural normalization of A. We call A* the 
threshold level since argmin t {(2: — t) 2 /2 + p(t;X)} = iff \z\ < A*. This can be easily seen from 
(z - t) 2 /2 + p(t; A) - z 2 /2 = t{t/2 + p(t; X)/t - z}. If p(t; A) is continuous at t = and concave 
in t € (0,oo), then A* < lim.t-+o+(d/dt)p(t;\). For simplicity, we may also require that p(t; A) be 
chosen such that A* = A, which holds for the penalties discussed in Subsection 12.21 such as £q, 
bridge, SCAD, MCP, and capped-£i penalties. 

In the following and in the proofs, we will use the short-hand notation 




p(0;X)=0; 
p(-t; A) = p(t; A); 



A* := inf{t/2 + 




(12) 



v 



||p(b;A)l|i = 5>(^;A), V 6 = (6 X , . . . , b. 
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Definition 2. The following quantity bounds a general penalty via £\ penalty for sparse vectors: 

A(a,k;X) = sup j \\p(b; A)||i : ||6||i < ak, \\b\\ = jfej. (13) 

Proposition 1. Let p*(t; Q = (\t\ + (C - \t\/2)%/2. Let A* be as in (E|). TTien, 

min {A*|t|/2, (A*) 2 /2} < A) < p*(t; A*), 
A(a,fc;A) < kp*(a;X*) < /cmax(a, 2A*)A*. 

Remark 3. It follows from Proposition^ that given a threshold level A* , all penalty functions sat- 
isfying general conditions (i)-(iv) are bounded by a capped-£\ penalty from below and the maximum 
of the £o an d £\ penalties from above, up to a factor of 2. The function p*(t; £) is a convex quadratic 
spline fit of max(£ 2 /2, C\t\)> the maximum of the Iq and t\ penalties with threshold level £. 

Remark 4. A trivial upper bound is A(a, k; A) < kmaxt p(t; A), which is useful only for bounded 
penalties. The bound p(t;X) < 7* A 2 holds with 7* = 1/2 for the £0 penalty, 7* = 7/2 for the 
capped-i\ penalty and MCP, and 7* = (1 + j)/2 for the SCAD penalty. If p(t; A) is concave in 
t € [0,oo), then A(a,k;X) < kp(a;X) by the Jensen inequality. For a > 2X* , A(a,k;X) < aX*k 
matches the trivial bound for the £\ penalty, for which A = A* . 

Next, we consider conditions on the design matrix X. Recall that X is column normalized to 

1 1 OC j 1 1 2 — *n 

for simplicity. Our analysis also depends on the sparse eigenvalues defined in and 
the restricted invertibility factor defined as follows. 

Definition 3. For q > 1, £ > and S C {1, . . . ,p}, we define the restricted invertibility factor as 
RIF 9 (C,g) = inf{ |,S|1/9|l f^ X " 1100 : \\p(u S c; X)\\t < t\\p(u s ; A)||i}. (14) 

L 7T, 1 1 IX 1 1 ^ > 

The restricted invertibility factor is the quantity needed to separate conditions on X and e in 
our analysis. For 1 < q < 2, sparse eigenvalues can be used to find lower bounds of RIF g (^, S). 

Proposition 2. Let CIF be as in |3]). Ift/p(t;X) is increasing in t £ (0, 00), then 

RIF 9 (C,5)> inf CIFgfoA). (15) 

— I fir I 

Remark 5. For the £\ penalty, RIF f/ = CIF f/ . If p(t;X) is concave in t £ [0, 00), then t/p(t;X) is 
increasing int. Thus, Proposition^ is applicable to all penalty functions discussed in Subsection \2.2[ 
including the £q, bridge, SCAD, MCP, and capped-£\ penalties. 

Remark 6. The CIF can be uniformly bounded from below in terms of sparse eigenvalues: 

rw ( t / i 1 < g < 2 H"-(fc + i) - (g/2)(fe/£) 1 / 2 K + (fc + 51)} 

j - (i + i)v^{i + ek/{u)Y~yi{i + £/ky^ ' 1 j 

for all 1 < £ < (p—\S\)/5 by Proposition 5 and (21) in where k = \S\, and K-(m) and K+(m) 
are as in For example, if we take £ = 2 and £ = 2k and q = 2, then 

CIF 2 (£,S) > {K„(3k) - K + (llk)/V2} /V4~5. 
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Remark 7. It follows from Proposition^ and Remark^ that conditions RIF g (£, S) > and 
l/RIF g (£, 5) = 0(1) are both l 2 -regularity conditions on X for 1 < q < 2. Moreover, rank(X) = p 
implies RIF(£, S) > 0. To check the t 2 regularity of these conditions, we suppose that the rows of 
X are iid from A(0, X!) with all eigenvalues of S in [01,02] C (0, 00). Then, c\j2 < k„(to) and 
K + (m) < 2c 2 with at least probability 1 — 5 £ [0, 1) for to < c-^n / \a{p / 5) for a certain C3 > 0. Let 
C4 = {ci / (£,c 2 )} 2 ■ In this event, setting k = s* and I = (to — s*)/5 in jll6\) yields 

min RIF 2 (£,S) > min CW 2 (£,S) > (ci/4)/V(l + £ 2 c 4 /4)(l + l/c 4 ) 

when 5s* / (m— s*) < C4 for some m < c^n/ ln(p/5), which holds when (s*/n) ln(p/5) < 03/(1+5/04). 

Finally, we consider conditions on the error vector. 
Assumption 1. An error vector e is sub-Gaussian with noise level a if for all t > 0: 

P(\u T e\ > at) < exp(-t 2 /2) 

for all vector u with \\u\\2 = 1 and 

PtWPAeh/^ 1 ' 2 > a(l + t)) < exp(-|^|t 2 /2) 

for all subsets A C {1, . . . ,p}, where Pa is the orthogonal projection to the range of Xa (that is, 
Pa = X aX a , where X A is the Moore-Penrose generalized inverse of Xa)- 

The above sub-Gaussian condition holds with e ~ N(0,a 2 I nxn ). It is equivalent to the more 
common version of the sub-Gaussian condition Ee v e//fT ' < e""" I 2 for all vectors v and a constant 
a' of the same order as a. As we have mentioned in Section 3, what we really need is a null- 
consistency condition, which we give below. The sub-Gaussian condition will be used to verify the 
null consistency condition. 

Assumption 2. Let rj £ (0,1]. We say that the regularization method satisfies the rj null- 
consistency condition if the following equality holds: 

min (\\e/ V - Xbf 2 /(2n) + \\p(b; A)||i) = \\e/ V f 2 /(2n). (17) 

Remark 8. Given rj = 1, the null- consistency condition means that if (3 = 0, then the global 
minimizer of |2j) is achievable at (3 = 0. This requirement is clearly necessary for the global 
minimizer of (0) to satisfy the error bound $19\) in Theorem^ below for \S\ = 0. Here, we also 
allow a slightly stronger condition with r/ < 1, which requires (3 = for (3 = when the noise e is 
proportionally inflated by 1/rj. 

Proposition 3. Suppose that e is sub-Gaussian with noise level a , < 5 < 1 and £o > 0. Suppose 
p(t;X) > ((A*) 2 /2) A (A*|*|) with A* > (1 + ( )(a/r j )n- 1 ^(l + y / 2ln(2p/S)) . Then, satisfies the 
rj null- consistency condition with at least probability 2 — e* 5 / 2 — exp(— n(l — l/\/2) 2 ), provided that 

J 1/2 ( T B n A = 0, \A\ = r ank(P A ) = \B\ = k\ 

max|A m / ax (A B P A A B /n): ^ + ^ + < 2n P C °" (18) 

Moreover, I118\) holds with no smaller probability than 1 — 5 4 /(Wp 2 ) if the rows of X are iid from 
iV(0, XI) and \/8\m ax (H) < Co(l + Co)- This means that under the sub-Gaussian condition on e, the 
rj null- consistency is an l 2 -regularity condition. 
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Remark 9. The condition p(t;X) > min (A 2 /2, A|i|) holds for the Iq, l\, SCAD, and capped l\ 
penalties, so that Proposition [3] is directly applicable with A = A* . In general, the condition of 
Proposition^ holds for all penalties considered in this paper when the threshold level in satisfies 
A* > 2(1 + £o)(o"/77)ra -1 / 2 (l + \/2 ln(2p/5)) , in view of the lower bound of p(t; A) in Proposition^ 
For Iq and £\ penalties, we may set (o = in Proposition [3| (the extra condition \18\) is not 
necessary). The simplified condition for £q penalty is explicitly given in Theorem 0. For the £\ 
penalty, the r\ null consistency condition is equivalent to || X T e|| 00 < rjXn. 

4.2 Basic Properties of the Global Solution 

We now turn our attention to the global solution of ([2]) with a general subadditive nondecreasing 
regularizer p(t;X). We first consider the estimation of Xj3 and (3. 

Theorem 1. Let S = supp(/3), /3 be as in A* as in IHfy . and RIF (? (^,5') as in [Lfy . Consider 
rj G (0, 1), and £ = (77 + 1)/(1 — rj), and assume that |i7| ) holds. Then for all q > 1: 

- P\\ q < (1 + T,)**\S\ 1/9 /BJF q (t, S), (19) 

and with at = (1 + r])/RIF 1 (£,,S) andA(a,k;X) in |7|), 

||X3 - Xp\\l/n < 2^A^aiA*, \S\; \) < 2^(ai V 2)(A*) 2 |5|. (20) 

By using the bound A^aiA*, \S\; X^j < \S\ max t p(t; A), we obtain the following corollary. 

Corollary 1. Consider penalties p(t; X) indexed by the threshold level; X* = X in U^) . Suppose 
that the n null consistency condition |j7[ ) holds. Let S = supp(/3) and 7* = max^ p(t;X)/X 2 . Then, 

\\X0 - Xp\\l/n < 2{(1 + v )/(l - r,)h*X 2 \S\. 

In particular, 7* = 1/2 for the £0 penalty p(t; A) = (X 2 /2)I(t 7^ 0), 7* = 7/2 for the capped-t\ 
penalty p{t; X) = (X 2 ~f/2) A (X\t\) and the MCP p(t; A) = A /J' 1 (1 - x/(X-y)) + dx, and 7* = (l + 7)/2 
for the SCAD penalty p(t; A) = A ff l min{l, (1 - (x/X - l)/( 7 - l)) + }dx. 

Remark 10. It is worthwhile to note that the prediction error bound in CorollaryUl does not depend 
on X , provided that penalty is large enough to guarantee null consistency. For the £q penalty, the 
null consistency requires only 1 1 as j- 1 1 2 — \/n on X, which we assume anyway. For other concave 
penalties in CorollaryUl we are only able to provide null consistency in Proposition^ under a mild 
condition on the upper eigenvalue of X~^PaX s/n, but not on the sparse lower eigenvalue of the 
Gram matrix. 

Next we provide an upper bound for the sparseness of (3 based on Theorem [T] and the maximum 
sparse eigenvalue K + (m). We denote by p(t;X) = (d / dt) p(t; X) any value between the left- and 
right- derivatives of p(-;X) and assume the left- and right-differentiability of p(-',X) whenever the 
notation p(t; A) is invoked. For example, if p(t;X) = X\t\, then p(0±;A) = ±A and p(0;X) can be 
any value in [—A, A] (which in all of our results, can be chosen as the most favorable value unless 
explicitly mentioned otherwise). 
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Theorem 2. Let {S, (3, A*, rj, £, a{\ and A(a, fc; A) 6e as m Theorem^ and S = supp((3). Suppose 



that |i7|) holds. Consider to > and integer mo > satisfying mo = /or to = and 



V2£K+(™ )A(a 1 A*, A)/m + ||X T e/n|| 00 < inf /?(s; A) (21) 

0<s<to 

for to > 0. Then, 

\S\S\<m:=mo + L£A( ai A*, |5|; A)/p(t ; A) J . (22) 

The n null consistency implies ||A' T e/n|| 00 < r/X* by Lemma[T]in Section 5. If p(t; A) is concave 
in t > 0, then the right-hand side of (|2"T|) can be replaced by /o(to; A) and p(to; A) > top(to] A). These 
facts give the following corollary for bounded and l\ penalties. 

Corollary 2. (i) Let p(t; A) and 7* 6e as in Corollary {J\ Suppose (0|) is rj null consistent in the 
sense of p7\ ) and p(aoA; A) > A(l — ai/7) for some ao > and ai > 0. //mo = a\S\ is an integer 
and 2-f*K + (a\S\)/a < (1 - ai/7 - n) 2 (l - n)/(l + ??), i/ten 

|5\5| <m:= (a + - 7 * /a °, )|5|. (23) 



1 - ai/7 

fiij Lei S 1 ^ 1 ^ = supp{P^ 1 ) with the Lasso ^ and CIF q as in (fSJ). In i/ie event ||X T e/n|| 00 < 77A, 

2« + (a|5|)/a (I- ») 3 ic(4)\ C l^ 1 01 Ioa\ 

CIF.ai + nVd-n)^) < (TT^F ^ 15 l] \ S \ <m -= ^ ™ 

Remark 11. Theorem^ and Corollary [D imply that the global solution (3 in |lj) is sparse under 
appropriate assumptions. For to regularization, we may take mo = to = with the convention 
K_l_(0)/0 = in \21\) . The Lasso also satisfies the dimension bound \S\S\ < m V 1 under the 
SRC: {K + (m + \S\) / K-{m + \S\) — l}/(2 — 2ao) < m/\S\ with an ao G (0, 1), provided that A > 
(1 + o(l)){Ky 2 (m) /ao}<r-y/ (2/n) hip An advantage of \2J$ is to allow an A not dependent on 
the upper sparse eigenvalue of the design for sub-Gaussian e. 

Remark 12. Let k* = sup 0<s< i{/o(t; A) — p(s; A)}/(s — t) be the maximum concavity of the penalty. 
Suppose K—(\S\ + m + fh — 2) > k*. Then, the penalized loss L\{b) in |lj) is convex in all models 
supp(b) = A with \A\S\ < m + fh — 2. This condition has been called sparse convexity \4^ . If m 
is as in $22\) or l2~S\) and (3 is a local solution of (0) with jf{j S : (3j ^ 0} < fh, then the local 
solution must be identical to the global solution. 

Remark 13. Consider penalties with A* = A which holds for all penalties discussed in Subsec- 
tion \2.2[ Let n £ (0, 1) and A* > be fixed. Suppose Theorem [1 or Corollary [H is applicable 
with m < a*\S\ for a fixed constant a* and all A > A*. Suppose in addition p(t;X) is contin- 
uous ml/A € [0, 1/A*] uniformly in bounded sets oft. Under the sparse convexity condition 
K—(\S\ + m — 1) > > 0, with the maximum concavity k* in Remark \12\ the global solution forms 
a continuous path in H p as a function of 1/ A > 1 / A* . This path is identical to the output of the 
path following algorithm in if it starts with j3 = at 1/A = 0. We will show in Theorem^ that 
gradient algorithms beginning from the Lasso may also yield the global solution under the sparse 
convexity condition. 
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As a simple working example to illustrate Corollaries Q] and [2 we consider the capped-^i penalty 
explicitly given in Corollary [TJ Let ao = 7/2 in Corollary [2j We find 

\\xp - xp\\l/n < \ 2 \S\ 7 (1 + V )/(1 - v), 
1K+(a\S\) <a{l-nf/{l + n) => \S \ S\ < (a + 1)\S\. 

The MCP, also explicitly given in Corollary Q] provides the same prediction bound and 

1 ^ + {a\S\)<a{2/?,-i 1 ) 2 {l-i 1 )/{l + i 1 ) \S \ S\ < (a + 9/4)|5| 

by the same calculation with ao = 7/3. Note that generally speaking, unless stronger conditions 
are imposed, Theorem [2] only implies that \S \ S\ = 0(|<S|) but not \S \ S\ = required for model 
selection consistency. The model selection consistency will be studied later in the paper. 

4.3 The Global Solution of £ Regularization 

This subsection considers the global optimal solution $ of £q regularization in (jlOp . Our first 
result says that under appropriate conditions, this solution is sparse. 

Theorem 3. If for all b G MP: e T Xb < Ary^^ll^lloll^^lb for some r) < 1, then [W\) satisfies the 
r\ null- consistency condition. It implies that the global optimal solution of U0\) satisfies 

1 — n z 1 — rj 

We also have the following result about model selection quality for £q regularization. 

Theorem 4. Assume that the assumption of Theorem^ holds. Let s = 2||/3||o/(l — rj 2 ) and f3 be 
as in Suppose \\X T (P$£ — £)\\oo/ n < a/2k_(s)A, where Ps is the orthogonal projection to the 

range of X$- Let S = supp((3), 5° = #{j £ S : |/3°| < Ay / 2/k_(s)}, and S = supp(f3^ °^). Then, 

\S-S\+0.5\S-S\ < 25°, \\X0 {£o) — 3°) [|| < 2X 2 S°. 

If the error e is sub-Gaussian in the sense of Assumption [H then the condition of Theorems [3] 
and [5] holds with at least probability 2 — e s for A > (cr/r/)(l + y / 21n(p/5))/y / n. Theorem [5] implies 
that model selection consistency can be achieved if the condition min^ggupp^) > A/a/k_(s) 
holds, which implies that 5° = 0. 

4.4 Approximate Local Solutions 

We have shown in Theorem [2] that under appropriate conditions, the global solution of ([2]) is sparse. 
If p(t; A) is both left- and right-differentiable, one can define the concept of local solution as follows. 
Given an excess v > 0, a vector (3 € M p is an approximate local solution of ([2|) if 

\\X T (Xp-y)/n + p0;\)\\ 2 2 <v. (25) 

This (3 is a local solution if v = 0. Note that by convention, A) can be chosen to be any value 
between p(t_;A) and /o(t+;A) to satisfy the equation. In this subsection, we provide estimates 
of distances between approximate local solutions and use them to prove the equality of oracle 
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approximate local and global solutions of ([2]). This gives the selection consistency of the global 
solution studied in Subsection 4.2. The oracle LSE is considered as an approximate local solution. 
In addition, we define a sufficient condition for the existence of a sign consistent local solution which 
generalizes the irrepresentable condition for Lasso selection and becomes an £2 regularity condition 
on X for a broad class of concave penalties. 

We first provide estimates of distances between approximate local solutions. We use the follow- 
ing function 9(t, n) to measure the degree of nonconvexity of a regularizer p(t; A) at t € R. 

Definition 4. For k > and t G R, define 

9(t, k) := sup{— sgn(s — t)(p(s; A) — p(t; A)) — k\s — t\}. 



Moreover, given u = (u\, . . . , u p ) T 6 MP, we let 9(u, k) = [9(ui, «),..., 9(u p , k)]. 

We are mostly interested in values of 9(t,n) that achieves zero. We note that 9(t,n) = for 
convex p(t, A) with k > 0. More generally, let k* be the maximum concavity as in Remark [T2l 
Then, 9(t, k) = for all t iff k > k*. For p(t+; A) < p(t— ; A), 0(i, «) > for all finite k. However, 
we only need 9(t,n) = for a proper set of t in our selection consistency theory. As an example, 
for k = 2/7, the capped-£i penalty p(t;X) = min(7A 2 /2, A|t|) gives 9(t,K) = when either t = 0± 
or \t\ > 7A. 

The following theorem shows that under appropriate assumptions, two sparse approximate local 
solutions Z^ 1 ) and (3^ are close. 



Theorem 5. Let (3 J 6e approximate local solutions with excess and A = /3 ^ 

K ±(") sparse eigenvalues in and 5^-* := supp((3^). Consider any S C {1, 

= \S\, integer m such that m + k > \S^ U S^\, and < k < K^(m + k). Then, 



{3 (2) . Let 
. , p} with 



\XA\\ 2 2 /n < 



2k_(to + k) 



^ { + l^ (2) \ 5«|0 2 (O+,k) + u} (26) 



(re_(m + fc) 
with v = {(i/ 1 )) 1 / 2 + (i/W) 1 / 2 } 2 , and 



|5\5( 2 )|< inf [# (j G 5 : < A /VM™ + fc)} + ||XA|| 2 /(A 2 

Ao>0 L l J J 



n 



(27) 



//in addition 9(0+, k) = and p(0+;A) > ||X^ c (X/3 — y)/^||oo wii/i S 1 D S^ 1 ) and |S*| > k, then 

3[{K 2 /K_(m + jfe) + Hi + (m)}\\XAg/n + V^] 



\S^\S\ < 



{p(0+; A) - \\X^(X(3 (1) - yj/nlloo} 



(28) 



~(2) 

Let 5 = supp(/3). For comparison between a sparse local or global solution f3 with IS^ 2 ) \S\ < 

m and an oracle solution (3 ^ with = S, the sparse convexity condition implies /3 = (3 
when k* < k_(|5| + m) as in Remark 1121 However, since k* = 00 when p(t+;X) < p(t—;X) at a 
point i > 0, the sparse convexity argument requires the continuity of p(t; A) for t > 0. This does not 



apply to the capped-^i penalty. In Theorem [5j if 9(0+, k) = 9((3 



(i) 

5 ' ' 



= with k < k_(|5| + m), 

then XA = 0, and hence 3=3 (since k_(|,SW U 5( 2 )|) > 0). Thus, the sparse convexity 
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condition is much weakened to cover all left- and right-differentiable penalties such as the capped- 
l\. On the other hand, Theorem [5] does not weaken the sparse convexity condition for the MCP 
p(t;X) = A /J* 1 (1 - x/(rfX))+dx, for which 6(0+;k) = iff k > k* = l/ 7 iff 0(f;/c) = for all 
t > 0. It is worth pointing out that for a piecewise differentiable penalty that is not second order 

differentiable, the condition 0(/3g ,k) = (thus, the uniqueness of local solution) typically requires 
\/3j^\ to be large to avoid the discontinuities of p(t; A) when j G S. As pointed out in Remark I12[ 
this is not necessary when the penalty is second order differentiable. This means that there can 
be advantages of using smooth penalty terms that may have fewer local minimizers under certain 
conditions. 

As a simple working example to illustrate Theorem we consider the capped i\ penalty of the 
form p(t;X) = min(7A 2 /2, X\t\). Let S = supp(/3). Assume that k = k_(to + |<S'|)/2 > 2/7. Then 

9(t, k) = when either t = 0± or \t\ > 7A. Therefore, if we define /3 ^ as (3^ = (3° when |/3°| > 7A 
and (3 =0 otherwise, then 

||XA||l/n< 



K-(m + \S\y 



and by taking Ao = 7A^/ K_(m + \S\), we have 

~ (2) \\XA\\l ~ (2) 3[l.25K + (m)||XA|||/n + ^ 2 )] 

'-^-(m+I^Dn' ' ^ '-{A-UXj^-yVnlU} 2 ' 

We now consider selection consistency of the global solution (|2|) by comparing it with an oracle 
solution with Theorem [5j For this purpose, we treat the oracle LSE as an approximate local 
solution by finding its excess v in (|25p , and provide a sufficient condition for the existence of a sign 
consistent oracle local solution. This sufficient condition is characterized by the following extension 
of the quantities 0\ and Q\ in (J9]) from the t\ to general penalty: 

0i =inf {9 : ||(XjX 5 /nr 1 pO"s + 35;A)||co < 9X*, V||^||oo < #A*}, 
9 2 = S up{\\XT c X s (X],X s y 1 p(v s + p s ;X)\\ O0 /X* : \\v s \\oo < 9 X \*}, 

where S = supp(/3) and (3 is the oracle LSE in Definition Q] (d). Note that when p((3 s ;X) = 0, 
9\ = is attained with vg = and consequently #2 = 0. 

Theorem 6. (i) Let S = supp({3) and Ps be the projection to the column space of X$- Suppose 
p(t; A) is left- and right-differentiable in t > and \\ X 'g c Pge\\ 00 < p(0+; A). Then, the oracle LSE 
(3 satisfies $25\) with v = \\p(/3 s ; A)|| 2 . If in addition Ji7| ) holds and v = = 9((3 , k) with a certain 
k < K„(m + \S\) and m in h2$fy or I123\) . then (3 is the global solution of 
(ii) Suppose p(t; A) is uniformly continuous in t in the region Uj<zs[f3° — 9i,f3° + 9\\. Suppose 

sgn(3°) = sgn(/3), mm |^| > 9\X* , X* > ||X T P^/n|| 00 /(l - 9 2 ) + . (29) 



Then, there exists a local solution (3 of (0) satisfying sgn(/3 ) = sgn(/3) and ||/3 — (3 ||oo < ^iA' 
If in addition |i7[ ) ZioWs and 9((3 , k) = with a certain k < K_(m + l^l) and m in h22\) or l2~3\ 
Then, /3 is the global solution of (0). 



18 



Remark 14. (i) For the capped-l\ penalty p(t; A) = min(7A 2 /2, A|t|), v = = 8((3 , k) for k > 2/7 
when mm jeS \dj\ > 7A. For tfie MCP p(i; A) = A /^'(l - x/(~/X)) + dx, 8{-,k) = for k > I/7. 

For the SCAD penalty p(t;X) = A jf l min{l, (1 - (x/X - l)/( 7 - l))+}dx, 0(-,«) = /or k > 
1/(7 — 1). fnj For i/ie ^1 penalty, p(b) = sgn(6) so f/iai ( f^9|) is identical to (G|) for the Lasso 
selection consistency. For concave penalties, |p(i;A)| is small for large \t\, so that {81,62} are 
typically smaller than {^1,^2} f or s ^ ron 9 signals. In such cases, 129\) is much weaker than f$J). 

For a nonconvex penalties such that p(t; A) = when \t\ > oqX for some constant ao > 0, we 
automatically have p((3 s ;X) = when mines' \f3°\ > a^X, which implies that 9\ = 82 = 0. This 
special case gives the following easier to interpret corollary as a direct consequence of Theorems [5] 
and[H 

Corollary 3. Let S = supp((3) and Ps be the projection to the column space of X$- Suppose 
p(t; A) is left- and right- differ entiable in t > and ||JTg c f ge/n||oo < p(0+;A). If holds and 
p((3 s ; A) = 0, and 8{j3 ,k) = with a certain k < «_(m + \S\) and m in or l23\) . then (3 is 
the global solution of Moreover, for any other exact local solution /3 of that is sparse with 
I sup(/3) \ S\ < m, we have (3 = (3 . 

Consider the simple examples of the capped-^i penalty and MCP. For the capped-^i penalty 
p(t; A) = min(7A 2 /2, A|t|), we pick a sufficiently large 7 such that 7 > 2/k_(|5| + m) for the m in 
(|22p or (I23p . This will be possible with m x \S\ when K_(m) is uniformly bounded away from zero 
for small m(\np)/n and \S\(\np)/n is even smaller. For the MCP p(t; A) = A /^(l - x/(A 7 )) + dx, 
we pick 7 > 1/k-(\S\ + m) for the m in ([22]) or (f23l) . If min je 5 > 7A, then the conditions 
of Corollary [3] are automatically satisfied for both penalties when || Xg c Ps £ / n] \oo < A (which can 
always be satisfied with a sufficiently large choice of A). It follows that in this case, f3 is the 
global solution of ([2]), and there is no other local solution with no more than m nonzero-elements 
out of S. The essential condition here is the null consistency (|17p . which is an £2 condition. 
Note that in view of Corollary [21 the RIF condition is not essential for the equality of the global 
and oracle solutions in these examples, both with finite 7* = 7/2. A similar result hold for the 
SCAD penalty, with somewhat different constant factors. The requirement of min^gs |/3°| > 7A 
is natural, and it directly follows (with probability 1 — 5) from the condition of mmj^s \Pj I > 
7A + a(l + v^lnfl S\/5)) A m ; r { 2 (XjX 5 ) under Assumption [2 

4.5 Approximate Global Solutions 

We have mentioned in Remark [13] that gradient algorithm from the Lasso may yield the global 
solution of ([2]) for general pit; A) under a sparse convexity condition or its generalization. Here we 
provide sufficient conditions for this to happen. This is done via a notion of approximate global 
solution. Given v > and 6 6 R p , we say that a vector (3 G MP is a {v, b} approximate global 
solution of ([2]) if 



i-HX/3-yll 2 + |K3;A)||i 



i-llXb-yHl + llp^A)!!! 



< v. (30) 



To align different penalties at the same threshold level, we assume throughout this subsection that 
A* depends on p(t; A) only through A in (fT2l) . e.g. A* = A. 
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One method to find sparse local solution is to find a local solution that is also an approximate 
global solution. This can be achieved with the following simple procedure. First, we find the Lasso 

solution (3 of ([7]). The following theorem shows that it is a {v, f3} approximate global solution of 

([2]) with a relatively small v under proper conditions. Now we can start with this solution (3 and 
use gradient descent to find a local solution /3 of ([2]) that is also an approximate global solution. 
The following theorem then shows that under appropriate conditions, this local solution is sparse. 
Therefore results from Subsections 4.2 and 4.4 can be applied to relate it to the true global solution 
of©. 

Theorem 7. Consider a 'penalty functions p(t; A) with A = A* in (fiffj). Suppose the n null consis- 
tency condition (T7| ) for p(t; A) with < n < 1. 

(i) Suppose to = 0(\S\) in \2J$ or under the SRC in Remark\TJ\for the Lasso (3 in ([?[). Then, 

the Lasso (3 is a {v, /3} approximate global solution for the penalty p(t; A) with v < A 2 |5|. 

(ii) Assume that p(t;X) is continuous at t = 0. Let (3 be an local solution of flp that is also a 
{v, P} approximate global solution. Let £' = 2/(1 — n). Consider to > and integer too > such 
that {2k + (too)6/too} 1 / 2 + ||X T e/n|| 00 < info< s <t () p(s; X), where b = £'max{z/, A^A*, 1^1; A) } with 
a[ ■= (1 + V )/RW 1 (C' ,S) and AJ := sup t > |p(t;A)|. Then, 

#{j ^:^0}<m:=m + [b/p(t ; A) J . 

Remark 15. If p(t; A) is concave in t, then A* = p(0+; A) and info< s <( p(s; A) can be replaced by 
p(to; A) for choosing (to, too). Theorem^ applies to the i\, capped-t\, MCP and SCAD penalties 
with A = A* x A*, but not to the bridge penalty for which A* = oo. 

Theorem [7] shows that the l\ solution (3 is {u, (3} approximately global optimal with v = 
0(|5|(A*) 2 ) in (130 j) . and that a local solution (3 which is also approximate global optimal is a sparse 
local solution. Thus, with b = 0((A*) 2 |S|) and p(t ;X) x (A*) 2 x (XI) 2 , the local solution /3 

obtained with gradient descent from (3^ 1 is sparse with S : f3j ^ 0} = 0(|5|). Here we 
assume that a line-search is performed in the gradient descent procedure so that the objective func- 
tion always decreases (and thus each step leads to an {v, /3} approximate global optimal solution). 
Now Remark [12] can be applied to this sparse local solution, providing suitable conditions for this 
solution to be identical to the global optimal solution. If miiij^s \Pj\ > CX un i v for a sufficiently 
large C, Corollary [3] (or Theorems [5] plus Theorem [6]) can be applied to identify this local solution 
as the oracle LSE (or penalized LSE) and the global solution. 

It is worth pointing out results of this paper concerning the global solution can be applied under 
the null consistency condition. For a general penalty function, this requires the condition (118p to 
hold. Although this is an £2 condition, it isn't needed for either l\ or £q penalty as pointed out 
in Remark [9j In fact, this condition is also not needed if we consider local solution obtained with 
more specific numerical procedures such as \42\ |4"7] that lead to specific sparse local solutions with 
oracle properties. Nevertheless, it is useful to observe that if the extra condition (I18j) holds, then 
such a local solution is also the unique global solution, and it can be obtained via other numerical 
procedures. 

5 Technical Proofs 

We first prove the following two lemmas, which will be useful in the analysis. 
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Lemma 1. If (3 is the global solution of |||), then \\X (y — X^/nW^ < A*. In particular, 
\\X T e/nWoo < r/\* under the n null consistency condition fli7| ). 

Proof. The optimality of (3 implies 

\\y - Xp\\l/{2n) + p(fij-,X) < \\y - Xp - x j t\\ 2 2 /(2n) + pfe + t; X) 

for all real t. Since p(t; A) is subadditive in t, 

txj(y - XP)/n < t 2 \\x j f 2 /{2n) + p{% + t; A) - p0j ; A) < t 2 /2 + p(t; A). 

Since t is arbitrary, we obtain the desired bound via the definition of A* in (]12p . □ 

Lemma 2. Assume the null consistency condition ( fi7| ) with r/ G (0, 1). Suppose /3 £ i p satisfy 

\\y - X3||l/(2n) + ||p(3; A)||i < ||y - X/3|||/(2n) + ||p(/3; A)||i + * 

with a certain v > 0. Lei A = (3 — (3, £ = (1 + ry)/(l — r/), and 5 = supp{f3). Then, 

||XA|||/(2n) + ||p(A Sc ; A)||x < (\\p(A s ; X)\\ 1 + v/i\ - rj). 

Proof. From the condition of the lemma, we have 

< l ,+ || ?/ -X/3||2/(2n) + ||p(/3;A)|| 1 -||y-X3|||/(2n)-||p(3;A)||i 
= v - ||XA|||/(2n) + e T XA/n + \\p((3; A)||x - \\p((3 + A; A)||l 

By (fl7|) . ||e/r/||^/(2n) < Her/77 - tXA|||/(2n) + ||p(*A; A)||i for all t > 0, which can be written as 

e T XA/n < ^||XA||l/(2n) + (r]/t)\\p(tA; A)||x. 

The above two displayed inequalities yield 

(1 - 7 ? t)||XA|||/(2n) - v < ( V /t)\\p(tA; A)||i + \\p(f3; A)||i - ||p(/3 + A; A)||i. (31) 

Now let f = 1. It follows from (131 f) . /35c = 0, and then the sub-additivity of p(f; A) that 

(1 - r,)||XA||l/(2n) - v < r,\\p(A; + \\p((3 s ; A)||i - ||p(/3 s + A 5 ; A)||i - ||p(A 5 c; A)|| a 

< (r ? + l)||p(A s ;A)|| 1 + (r ? -l)||p(A 5 c;A)|| 1 . □ 

5.1 Proof of Proposition [I] 

Let t > 0. By (JT2J), p(i; A) > t(A* - t/2) > t\*/2 for t < A*. For t > A*, p(t; A) > p(A*;A) > 
(A*) 2 /2. This gives the lower bound of p{t;X). Let to be the minimizer in (|12p in the sense of 
sc/2 + p(x; A)/x — > A* as x — > to (when to is a discontinuity of p(-; A)) or x = to- Let x > and 
g = L*/ X J- Since p(t; X) is nondecreasing and subadditive in t > 0, we have 

pit] A) < p(gx; A) + pit - qx; A) < (q + l)p(x; A) < (t + a;)p(x; A)/x. 

It follows that (let x -> to) p(t; A) < (t + i )(A* - to/2) < max t >> (i + t')(X* - t' /2) = p*(t;X). The 
bound for A(o, fe; A) follows similarly from (let x — > to) 

\\p(b; A)||i < (N + X )l x ^ k ( a + X )/ x ^ k ( a + *o)(A* - t /2) < kp*(a; A). 

The fact that p*(a; A) < max(a; 2A*)A* can be verified by simple algebra. □ 
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5.2 Proof of Proposition [2] 

Let f{t) = t/p(t; A) and A be the index set of the |5| largest \uj\. Since p(t; A) is nondecreasing in 
1*1, \\p{usc; A) ||i < £||p(«s;A)||i implies ||p(w A c; A)||i < £\\p(u A ; A)^. Since /(t) is nondecreasing 
in t, 

< \\p{uac] A)||i/(|| ^yl c 1 1 co ^■A c Woo 

This implies (|15p . In the above derivation, the first inequality follows from the definition of f(t) 
and ||wa c ||oo > \ u j\ for all j G A c ; the second inequality is due to the condition ||p(it A c; A)||i < 
A)||i; the third inequality follows from the definition of f(t) and the condition Hn^cH^ < 
for all j £ A. □ 

5.3 Proof of Proposition [3] 

Since the left-hand side of (|17|) is increasing in p(t; A), we assume without loss of generality that 



p(t;X) = min(A 2 /2,A|t|), A = (1 + Co)W? ? )A , A = (l + y/2 ln(2p/<f))/v^. 
Since ||X T e/n|| 00 < max| A | =1 || J^a^ lb/ an d H-Pyi^lh < Iklb, Assumption [T] implies that 



||X T e/n|| 00 < crA , ||P A £||2 < cr min V^%A , v 7 ^ , V A C {1, . . . ,p}, (32) 

with at least probability 1 - exp(-n(l - \j\[2) 2 ) - Efc=l (D (<V( 2 P)) fc > 2 ~ S n ~ e 5/2 . 

Let A = {j : \bj\ > A/2} and k = \A\. It suffices to consider the case where A and b satisfy 

X A b A = P A (e/ V - X A cb A c), rank(P A ) = \A\ = k < WM/JM < 



AV2 - (l + Co) 2 Ag' 

since these conditions hold for the global minimum for ([2]) with y = e/r] and the capped-^i penalty. 
Under these conditions, we have Xb = P A e/rj + P A X A cb A c and 



(Xb) T (e/r,)/n - ||X6|||/(2n) - \\p{b; A)|h 

\\P A (e/r,)\\l/(2n) + (PiX A cb A c) T (e/ V )/n - \\PiX A cb A 4 2 2 /(2n) - ||p(b;A)lb 



\s/ V \\i/(2n) - Ws/rj - Xb\\ z 2 /(2n) - \\p(b; A)||i 

li 

i- ||PiX Ac 6 A c|||/(2n)- ||p(6;A)||i 

< ||P A (e/»7)|||/(2n) + (X Ac b A c) T (e/7 ? )/7i - (X A cb A c) T P A (e/r])/n - ||p(6; A)||i 

< Xlkia/r]) 2 /2 + XoiaMlb^ - (X A cb A c) T P A (e/r])/n - ||p(6;A)||i 

< ( r ^-l)||p(6;A)|| 1 -(X A cb A c) T P A (e/r ? )/n. (33) 



■1 + Co 

In the above derivation, the second inequality uses (|32p and the third uses the fact that ||p(b; A)||i 
A 2 A:/2 + A||6 A e||i by the definition of A and A = (1 + Co)(o"/r/)A . 
It follows from the shifting inequality in 01] that 



s P A X A cb A c 



< max WXlP^hiWb^W^k 1 / 2 + \\b A c\\ 1 /k 1 ^ 2 

BnA=®,\B\=k V" 

< max \\XlP A £\\2(Xk/2+\\b A c\\ 1 )/Vk. 

BC\A=%,\B\=k 
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In the above derivation, the first inequality uses the shifting inequality and the second uses the fact 
that ||&a c ||oo < A/2 due to the definition of A. It follows from ([TBI) and ||Pa£||2 < crXoVnk of (f3"2"j) 
that for all \A\ = \B\ = k < 2/{(l + Co) 2 A§} with BC\A = $, 

\\X T B P A e\\ 2 < \lll^X T B P A X B )\\P A e\\ 2 < (a\ V^k){(oV^) = a^X nVk. 
Thus, by combining the above two displayed inequalities, we find 

s^P A X AcbAc /(r,n) < ^CoWfc / Afc/2 + = ^-i^ X)h . 

rjn \ yfk > 

due to A = (1 + Co)(o-A?)A and ||p(b;A)||i = X 2 k/2 + A||6^||i- Thi s and §5§ yield the null 
consistency condition (|T7]) . 

It remains to prove that (|18p is an ^-regularity condition on JC. Suppose that the rows of X 
are iid from iV(0, XI). Let N kjm denote a fc x m matrix with iid iV(0, 1) entries. We may write 
X B = A^n, P (S 1/2 ) pxB . Let £/t/ T and VDW T be the SVDs of P A and (S 1/2 ) pxB respectively 
For fixed {A, B}, the entries of the k x k matrix U T N n ^ p V are uncorrelated N(0, 1) variables, so 
that we can write P A X B = U N Kk V T (Z 1,2 ) pxB . Thus, by Theorem 11.13 of p2], 

P{xUi x (X T B PAX B )>(2k 1 / 2 + t)XUl^)} < P{X]H(Nl k N Kk )>2k l l 2 + t} 

< $(-*) < e~* 2 / 2 /2, t > 0, 

where is the iV(0, 1) distribution function. Since there are no more than ( fc fc p_ 2fc ) choices of 
P A with rank and \B\ = k with ^4 n B = 0, 

^^max^^At^lB) < X]H^)(2k l l 2 + v^ln^/J)) , VKKn, (34) 
with probability no smaller than 

i l \-( 5 Y k ( p U, i A (^/(4p)) 2fc A4/nfi2 , 

In the event ([M]). we have that for all |v4_] = |J3| = A; and fe(l + Co) 2 (l + V 2 ln ( 2 P/<5)) 2 < 2n > 
x i/ 2 ^To ^ / , . A^ x (S)(2fcV 2 + V8Hn(2p/^)) V8Xl /2 ™ 

\Aa, x {X B PAX B /n) < - — — — = — 



{fcV2(l + C )(l + V21n(2p/5))}/x/2 1 + Co 
This proves the desired result. □ 



5.4 Proof of Theorem Q] 

Let A = (3 — 3. Lemma [2] (with v = 0) implies that 

||XA|| 2 /(2n) + ||p(A S c; A)||i < £\\p(A s ; A)||l (35) 

Thus, (fTl|) gives 

|| AH, < U^XAIUSIVy^RIF^, S)}. (36) 

It follows from Lemma [T] that J|AC T (y — Xfy/nW^ < X* and ||X T e/ri|| 00 < r/X*. Thus, we have 
||X T XA/n|| 00 = \\X T {y -X0- e)/n||oc < (1 + r?)A*. This and (JMH yield (US}. 

Now by combining the definition of A(a, |5|;A) and ||A||i < (1 + ry)A*|S'|/RIFi(^, S), which 
follows from (|19p . we obtain an estimate of ||p(Ag; A)||i in (|35p . which leads to the first inequality 
in (|20p . The second inequality in (|20p then follows from Proposition [1] and Remark [41 □ 
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5.5 Proof of Theorem [2] 

Let 5i = {j G S\ S : \Pj\ > t } and S 2 = { j G 5 \ 5 : 1/3,1 < fo}. As in the proof of ®, it 
follows from the error bound (fl"9j) and the definition of A(ai, A) in (fl~3j) that ||p(As; A)||x < 
A(ai, |5|; A) with the given a\. Thus, 

\S X \ < ||p(A 5 c; A)||i/p(t ; A) < £||p(A 5 ; A)||i/p(t ; A) < CAK |5|; A)/p(t ; A). (37) 

Let A2 > y / 2C K +( m o)A(aiA*, |S*|; A)/mo satisfying A2 + ||X T e/n|| 00 < info< s <t p(s; A). The first 
order optimality condition implies that for all j G 5, a?J(y — Xj3)/n = p(t; A)| t a .. For j G S*2, 

G (0,t ), so that |a;T(y-x3)/n| > (A 2 + ||X T e/ri|| 00 ) by (HQ). Thus, for any set Ac5 2 with 
|A| < m , 

(A 2 + ||X T e/«lloo)|^| < \\X\{y - x3)/n||i < HX^e/HU^I + \A\V^X A /y/n\\2\\X*\\ 2 /yfii. 

Since H-X^A/nlli < «+(m ), \2\A\ < |^| 1/2 v / K+(m )||XA||^/n. It follows from Theorem □ that 
\A\ < K + (m Q )\\XA\\l/{nXl) < 2^K+(m Q )A(a 1 , \S\; A)/A| < m . Thus, max^g^^ |A| < m , 

which implies that |5 2 | < mo- Combine this estimate with ([37]) . we obtain the desired bound. □ 

5.6 Proof of Theorem 

It follows from the assumption of the theorem that for all b G MP, 

\\Xb - e/ri\\l + A 2 n||6|| - \\e/<q\\% = ||X6||1 + (2/rj)e T Xb + A 2 n||6|| 

is bounded from below by ||X6||| - 2A x /n||6|| ||X6|| 2 + A 2 n||6|| = (||X6||| - A ^/rz 1 1 £»| 1 ) 2 > 0. This 
implies the null-consistency condition. Moreover, (|3ip with t = 1/r] and v = implies that 

Il3 ( ' 0) ||o - ||/3||o < ^ 2 H3 ( ' 0) -/3||o < ^p^llo + V 2 \Mo, 

which leads to the first bound of the theorem. The second bound is a direct consequence of 
Theorem (TJ since A(f,|S|;A) = A 2 |S|/2 by CE}. □ 

5.7 Proof of Theorem [4] 

For simplicity, let 3 = 3 > 5 = supp(3), and S = supp(/3). We know that ||3||o < (1 + ?? 2 )/(l - 
r/ 2 )||/3||o and thus ||/3 — /3 ||o < s. Similar to the proof of Theorem [3j we have 

>||X(3 - + 2(^3° - y) T X(3 - 3°) + A 2 n[||3||o - ||3°||o] 

> K _(sH|3 - - V2K_(a)An||(3 - 0°)§_ s \\i + A 2 n[||3||o - ||3°|| ] 

>K-(s)n\\0 - 3°) 5 || 2 + K-(s)n\\@ - d°)ss\\ 2 2 ~ 2^0.5A 2 n|5-S|^-(s)n||(3 - 3°)^_^|| 2 

+ A 2 n[||3||o-||3°l|o] 
>K_( S )n||(3Vslli - 0.5A 2 n|S - S\ + A 2 n[||3||o - ||3°||o] 
>2A 2 n(|5 -S\- 5°) - 0.5A 2 n|S - S\ + A 2 n[||3||o - ||3°||o] 
>A 2 n(|S - S\ + 0.5|5 -S\- 25°). 
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The first inequality uses the same derivation of a similar result in the proof of Theorem [3l The 
second inequality uses the assumption of the theorem, (Pgs — s) T X = (X/3 — y) T X, and the fact 
that (Xp°-y) T X s = 0. The forth inequality uses b 2 - 2ab > -a 2 and \\0-0 o )s\\ 2 > \\@°) a _§h- 
The fifth inequality uses 

K.(s)n\\0°) s _ § \\ 2 >K.(s)n Yl <P°?j 

iG5-5;|/3 3 | 2 >2A/ K _(s) 



>2A 2 n 



{j G S - S; (/3°) 2 > 2\ 2 /k„(s)} > 2X 2 n(\S - S\ - 5°). 



3 



The last inequality uses the derivation ||/3||o — \\P ||o > \S\ — \S\ = \S — S\ — \S — S\ and simple 
algebra. This proves the first desired bound. Similarly, we have 

>\\X0 - 3°) HI - v /2Mi)An||(3 - 3Vslli + *M\\Mo - ||3°Ho] 

" i2 , n e { \ Wf^t a°\ II 2 i n c / \ ii/a ii2 



>0.5||X(/3-/3 )||^ + O.5K_(a)n||09-/3 )sr[|i + O.5k_(*)«||03 - /9 



'5-5H2 



^2^(s)\S-S\Xn\\0 " 3°)5_5ll2 + A 2 n[||3||o - ||3°l|o] 



>0.5\\X((3 -/3)\\l + 0.5«_(*)n||G9 ) s _ § \\ 2 2 - X 2 n\S -S\ + X 2 n[\\f3\\ ~ \\P Mo] 
>0.5||X(3 - 3°)lli + A 2 n(|5 - S\ - 5°) - X 2 n\S- S\ + X 2 n[0\\o ~ ||3°l|o] 
>0.5||X(3-3°)ll|-A 2 n ( 5 o . 

The second inequality uses the definition of K-(s). The third inequality uses 0.5b 2 - V2ab > —a 
and ||(/3 — (3 )s\\2 > )g_g\\2- The fourth inequality uses the previously derived inequality 
K-(s)n\\(f3 ) s _§\\2 > 2A 2 n(|5' — S\ — 5°). The last inequality uses the derivation \\(3\\o — \\/3 ||o > 
IS") — \S\ = \S — S\ — \ S — S\ and simple algebra. This leads to the second desired bound. □ 

5.8 Proof of Theorem 

Since (3^ are approximate local solutions with excess (|25p gives 

\\X^XA/n + p(3 (1) ; A) - p(3 (2) ; A)|| 2 < (u^) 1 ' 2 + (v^) 1 / 2 < yfr. 
Let E = fiK 1 ) U S^ 2 \ Since \E\ < m + k and « < K_(m + fc), it follows that 

||XA|| 2 /n < - A T (p(3 (1) ; A) - p(3 (2) ; A)) + v^||A|| 2 
<«||A|| 2 + |(A T 0(|3 (1) U)|+^||A|| 2 

< K ||A|| 2 + f||0(|3?U)ll2 + ^)||A|| 2 



Since \\A\\l < ||XA|||/{rwc_(m + k)}, ([26]) follows. 

Let Ex ■= {j ■ iffi-P^l > A /^K-(m + fc)}. We have A§|^i| < K-{m+k)\\ A||| < ||XA|||/n. 
Since j G S\S^ implies ffP - fif = pf \ ((27} follows. 

Let £ 2 := \ S and A(, = p(0+; A) - ||X^ c (x3 (1) - y)/n||oo- For j G E 2 , 

(2) X „T/ V ^) 



A' < p(0+;A)+sgn(/3 w )^(^ -y)/n 
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< {p( +;A) -sgn(/3 l ) ) P (pf ) ;\)} + \x](Xf3 ( ' - y)/n + p0f; A)| + |ajJXA/n|. 
Since 0(0+, k) = means p(0+;A) - sga(t)p(t; A) = p(0+;A) -p(|t|;A) < n\t\ for t ^ and 

\E 2 \X' < K ||A £2 || 1 + ||Xj 2 (X^ (2) -y)/n + / 6(^g;A)|| 1 + ||Xj 2 XA/n|| 1 
< vT^{ K ll A ll2 + ||Xj 2 XA/n|| 2 } 

Since || A||| < ||JCA|||/{n«_(m + fc)} and ||Xj 2 XA/rt||| < K + (m)||XA|||/n ) ® follows. □ 

5.9 Proof of Theorem [6] 

We note that X J (y - Xp°) = X T {e - P s e) = X T P^e. 

(i) Since x](Xp° - y)/n + p(f3°; A) = p0°; A) for j e S and x] {Xfi° - y)/n + /j(/3?; A) = 

for j ^ S, u = \\p((3 s ; A)|| 2 . Let (3^ ^ be the global solution of ([2]). Under the additional conditions, 
3 (2) = 3° by Theorems [2] and ([26]) with 3^ = 3°- 

(ii) Since mines' > 6>iA*, the map 65 — > (3 S — (X~g X g /n)^ 1 p(bs; A) is continuous and 

closed in the rectangle 1? = {i> : \\v$ — /^glloo < 8i^*, v s c — 0}. Thus, the Brouwer fixed point 
theorem implies a (3 £ B satisfying sgn(/3) = sgn(/3) and 

X T s (y - X0) = (X T s X s /n)0° - 3)s = p0 s ; A)- 

Since y - X$ = y - Xp° - X s - P°)s = y - Xp° - X S (X T S X s /n)' 1 p0 s ; A), 

||Xj c (y - x3)/n|U < \\Xj c (y - x3°)/n|| oc + # 2 A* < A*, 

so that /3 is a local solution of ([2|) . The proof of global optimality of f3 is the same as (i) . □ 

5.10 Proof of Theorem \7\ 

The proof is similar to that of Theorem [2j As intermediate results, we will prove lemmas that are 
analogous to Lemma [1] and Theorem [TJ In the following, we assume that the conditions of the 
theorem hold. We also let A = (3 — (3. 

Lemma 3. Let \\ := sup t>0 \p(t; A)| . We have \\X T (y — XP)/n\\oo < X\. 

Proof. A local solution satisfies \xj (X(3 — y)/n\ = \p((3j; A)| < A* for all j. □ 
Lemma 4. We have \\XA\\l/(2n) + ||p(A 5 c; A)||i < b with A = @ — (3. 

Proof. We consider two situations: the first is ||p(Ag; A)||i < u, and the second is \\p(A.s; A)||i > v. 
In the first situation, we obtain directly from Lemma [2] that 

||XA|||/(2n) + ||p(A 5 c; A)||i < 2i//(l - 77) = £V. 

In the second situation, we obtain from Lemma[2]that ||XA|||/(2n)+||p(A 5 c; A)||i < f'||p(A s ; A)||i. 
Therefore (TT3D g ives ll A lli < l|X T X A|| 00 |S"|/{nRIFi(£', 5)}. It follows from Lemma [3] that 
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\\X T (y - x3)/n||oo < AJ. Similarly, ||X T (e/r ? )/n|| 00 < A* = A due to (HZ). Since A = A* < 
inf t \p(t; X)/t\ < AJ, we have ||X T XA/n|| 00 = ||X T (y - x3 - e)/n||oo < (1 + This implies 

that ||A||i < «i Af liS], where a[ = (1 +rf)/KSFi(£',S). This can be combined with Lemma[2]and 
the definition of A(a, \S\; A) to obtain 

||XA|||/(2n) + \\p(A S o; A)||i < fA^Al, |5|; A). 

Combine the two situations, we obtain the lemma. □ 

We are now ready to prove the theorem. 

(i) Let A^ = 3 (M -(3. Since \e T XA^/n\ < \\X A^ |||/(2n) + ||/>(A^; A)||i by {TTJ), 

i> = ||XA^)||2/(2n)-e T XA^)/n+||p(3 ( ' l) ;A)|| 1 -|| /0 (/3;A)|| 1 

< 2{||XA^)||2/(2n) + ||p(A^);A)|| 1 } 

< 2|eaiA 2 |S| + A(aiA|5|/m,m;A)} 

< 2|eaiA 2 |S| + A 2 max(ai|S|,2m)} = 0(A 2 |S|). 

(ii) Let Si = {j 6 S \ S : > *„}, S 2 = {j G 5 \ 5 : %\ < t }, and A 2 > v /2/e+(mo)6/m 
satisfying A 2 + ||X T e/n|| 00 < info< s <t p(s; A). Just as in the proof of Theorem [21 we have \S\\ < 
||p(A S c; A)||i/p(t ; A), and for any AcS 2 with |A| < m , \A\ < /q-(m )||-X'A||f /(nAf). We apply 
Lemma|4]to obtain \S\\ < b/p(to;X) and \A\ < 2K + (mo)6/A 2 < mo- Thus, max A( _^ 2 ui< mo 1^1 < 
itiq, which implies that |5 2 | < m,Q. The theorem follows. □ 
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